-
Notifications
You must be signed in to change notification settings - Fork 101
ComputeDomains: adjust task reconciliation behavior for large CD formation #732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9524592 to
7815f9a
Compare
| // whitespace to the left. | ||
| for _, ip := range slices.Sorted(maps.Keys(m.ipToDNSName)) { | ||
| dnsname := m.ipToDNSName[ip] | ||
| klog.Infof("%26s -> %s", dnsname, ip) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current patch sorts the keys, and hence the IP addresses :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // perform a stable sort of IP addresses before writing them to the nodes | ||
| // config file. | ||
| if !maps.Equal(newIPs, previousIPs) { | ||
| klog.Infof("IP set changed: previous: %v; new: %v", previousIPs, newIPs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| if err := pm.updateNodeStatus(ctx, status); err != nil { | ||
| return fmt.Errorf("failed to update node status: %w", err) | ||
| return fmt.Errorf("pod update: failed to update note status in CD (%s): %w", status, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the wrapper (workqueue) does not enrich the error message with meaningful context information, and so I added pod update: here -- makes it easier to understand what a log message means. Example:
I1119 22:10:21.531887 1 workqueue.go:197] Reconcile: pod update: failed to update note status in CD (Ready): simulated error 5 (attempt 5)
| // UpdateComputeDomainNodeInfo updates the Nodes field in the ComputeDomain with | ||
| // info about the ComputeDomain daemon running on this node. Upon success, it | ||
| // reflects the mutation in `m.mutationCache`. | ||
| func (m *ComputeDomainManager) UpdateComputeDomainNodeInfo(ctx context.Context, cd *nvapi.ComputeDomain) (rerr error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I felt like renaming this from UpdateComputeDomainNodeInfo to EnsureNodeInfoInCD after I repeatedly found myself slightly confused about the high-level responsibility of this method.
7815f9a to
f8fceea
Compare
| // fails and is retried, the delay grows exponentially starting from the | ||
| // lower value up to the upper bound. | ||
| workqueue.NewTypedItemExponentialFailureRateLimiter[any](250*time.Millisecond, 3000*time.Second), | ||
| workqueue.NewTypedItemExponentialFailureRateLimiter[any](250*time.Millisecond, 3000*time.Millisecond), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| } | ||
|
|
||
| func DefaultCDDaemonRateLimiter() workqueue.TypedRateLimiter[any] { | ||
| return NewJitterRateLimiter(workqueue.NewTypedItemExponentialFailureRateLimiter[any](5*time.Millisecond, 6000*time.Millisecond), 0.5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought quite a bit about these numbers, but of course these are just an attempt to pick something meaningful -- we will see over time if and how we want to change method and parameters.
|
/ok to test f8fceea |
|
User tested with the patch. 160 nodes: Convergence here took 10 minutes: That's good enough of an improvement compared to "infinity" and I think this is the signal that we need today to proceed with the patch. Generally, we still go through a plethora of conflicts: The scaling behavior (the number or conflicts seen during the convergence process, depending on the number of nodes contributing to the CD) here still is O(N^2). That's not ok. We need to get this more linear and I am sure that there's still a lot of room for us to make improvements. |
|
On thing that would help is if we had separate API server objects for each IMEX domain to put their status into (i.e. one per NVLink partiton / clique). That way each write would only be competing with |
f8fceea to
fa4e30a
Compare
I also briefly thought about that. I think there are many different dimensions in solution space that we can explore. Any kind of sharding on a per-clique level may be super useful. And/or server-side applies: https://kubernetes.io/docs/reference/using-api/server-side-apply/ (I think that has a lot of potential), and probably other strategies. |
|
/ok to test fa4e30a |
|
User feedback, round 4. 215 nodes -- tested with the last commit on this branch. All CD daemons started within less than ten seconds: We're down to ~four minutes convergence time: Specifically:
Number of conflicts: The four minutes convergence time coincide with the four minutes of informer resync period chosen in one of the last few commits here. I inspected the logs for the 'worst-case' pod and indeed found that the informer resync once again triggered recognizing the NotReady -> Ready transition: How is the "I am now ready" event (for every single CD daemon) distributed over time? The awk cmdline translates a log line timestamp into a number: the difference in seconds to the start of the convergence at 15:16:16, and then then we simply plot a histogram showing the distribution of those numbers (frequency to the right, time upwards (in seconds)). That's a fairly good-looking distribution. I think we should move ahead with this patch now. Separately, we should further investigate the eligibility of the normal informer-based event propagation for (not even that fast) detection of pod readiness changes. We're probably still doing something wrong in our informer/pipeline/event-handling setup. |
klueska
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few nits, but looks good in general.
Any incoming pod update should terminate the retry loop initiated for a previously incoming pod update. The same for any incoming CD update. Any pod update refers to the same pod object, and any CD update refers to the same CD object. Explicitly use that by using hard-coded keys. Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
For large CDs this makes it faster to identify changes from the log output. Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
This is less error-prone: if we treat `node == nil` generally as success, we may miss persisting a pod state transition in edge cases and for edge-case state transitions after the initial NotReady -> Ready transition. Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
During reconciliation of pod and CD updates a number of log messages and error messages are flowing through the system, and this change makes it easier to understand which messages belong together and what is actually happening. Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
This reduces the amount of log volume on the default log level for large CDs. Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
This was meant to be three seconds, not 3000 seconds. This is a node-local retry and we can easily afford not backing off towards O(1 min) or further. Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Change upper bound from 1000 s (17 minutes) to something much less. For formation of a larger ComputeDomain (N nodes), many writers desire updating the same API server object. The overall work that has to be done by the API server scales linearly with N: a certain number of updates (at least N, in the best case) is required. Hence, in the best case (perfect serialization, no conflicts) the time it takes overall to get all updates in (the overall ideal convergence time C) is scaling with N, linearly. In the worst case, when individual updates always conflict with each other (are performed 'at the same time' against the same reference state), convergence is never achieved. Without centralized coordination, backing off individual retriers is a way to spread out updates over time. The nature of the distribution of those back-offs governs how long the actual convergence time takes compared to the ideal case C. The ideal case C is governed by the rate R at which the central entity can process updates. If we naively back off exponentially without a sane upper bound then we don't homogenously spread the update load over time, but try to get less and less updates injected into the system per time, as time progresses. The attempted update rate then falls far below R (the possible update rate). That makes convergence unnecessarily slow. If we do not back off enough, an opposite effect may occur because the global rate of retries accumulating at the central point (API server) may always exceed R, and hence thrash resources and slow things down compared to the theoretical update rate maximum (in case of perfectly serialized updates). Hence, there is a sweet spot between both extrema. The positioning of that sweet spot strongly depends on R. Summary: 1) We do not want to back off individual retriers too far, otherwise we operate at an update rate lower than necessary and artificially slow down the convergence process. 2) We need to back off individual retriers enough to prevent thrashing from slowing us and others down. This is critical for making sure the convergence time scales linearly with N (instead of, say, O(N**2)). This patch primarily takes care of (1). For (2), in the future, we may want to further increase that upper bound after a certain amount of time (if e.g. a 5 second cap does not result to overall convergence after e.g. 30 minutes, it may be worth backing off further, to remove stress from the API server). Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
The client-go rate limiters such as `ExponentialFailureRateLimiter` do not implement jitter. In a user's environment, formation of a CD across 144 nodes has shown that the absence of jitter results in significant retry attempt correlation across nodes -- even after ~10 retries, resulting in otherwise preventable conflicts (and hence increased convergence time). That effect can be diminished by adding jitter, which should allow for The JitterRL implementation provided by this patch is a simple, custom implementation that I validated with simulated errors and careful placement of log messages. Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
See issue NVIDIA#742. Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Generally, output looks cleaner with that. Makes sorting lexicographically more meaningful (otherwise 10 comes right after 1). Should be OK even for upgraded systems as IMEX daemon nodes config and /etc/hosts are written in tandem. Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Sort by DNS name (map value). Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
fa4e30a to
4df71bc
Compare
|
Test suite passed for the most recent commit: |
|
/cherry-pick release-25.8 |
|
🤖 Backport PR created for |



+misc changes.
See commit messages.
Tests: